Scaling laws and fluctuations in the statistics of word frequencies

نویسندگان

  • Martin Gerlach
  • Eduardo G. Altmann
چکیده

In this paper we combine statistical analysis of large text databases and simple stochastic models to explain the appearance of scaling laws in the statistics of word frequencies. Besides the sublinear scaling of the vocabulary size with database size (Heaps’ law), here we report a new scaling of the fluctuations around this average (fluctuation scaling analysis). We explain both scaling laws by modeling the usage of words by simple stochastic processes in which the overall distribution of wordfrequencies is fat tailed (Zipf’s law) and the frequency of a single word is subject to fluctuations across documents (as in topic models). In this framework, the mean and the variance of the vocabulary size can be expressed as quenched averages, implying that: i) the inhomogeneous dissemination of words cause a reduction of the average vocabulary size in comparison to the homogeneous case, and ii) correlations in the cooccurrence of words lead to an increase in the variance and the vocabulary size becomes a non-self-averaging quantity. We address the implications of these observations to the measurement of lexical richness. We test our results in three large text databases (Google-ngram, Enlgish Wikipedia, and a collection of scientific articles).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Monofractal Density Fluctuations and Scaling Laws for Count Probabilities and Combinants

The relation of combinants to various statistics characterizing the fluctuation pattern of multihadron final states is discussed. Scaling laws are derived for count probabilities and combinants in the presence of homogeneous and clustered monofractal density fluctuations. It is argued that both types of scaling rules are well suited to signal Quark-Gluon Plasma formation in a second-order QCD p...

متن کامل

The dependence of test-mass thermal noises on beam shape in gravitational-wave interferometers

In second-generation, ground-based interferometric gravitational-wave detectors such as Advanced LIGO, the dominant noise at frequencies f ∼ 40 Hz to ∼200 Hz is expected to be due to thermal fluctuations in the mirrors’ substrates and coatings which induce random fluctuations in the shape of the mirror face. The laser-light beam averages over these fluctuations; the larger the beam and the flat...

متن کامل

Fluctuations in fluid invasion into disordered media.

Interfaces moving in a disordered medium exhibit stochastic velocity fluctuations obeying universal scaling relations related to the presence or absence of conservation laws. For fluid invasion of porous media, we show that the fluctuations of the velocity are governed by a geometry-dependent length scale arising from fluid conservation. This result is compared to the statistics resulting from ...

متن کامل

Unversal Features of the Order-parameter Fluctuations

We discuss the universal scaling laws of order parameter fluctuations in any system in which the second-order critical behavior can be identified. These scaling laws can be derived rigorously for equilibrium systems when combined with the finitesize scaling analysis. The relation between order parameter, criticality and scaling law of fluctuations has been established and the connexion between ...

متن کامل

Covariations in ecological scaling laws fostered by community dynamics.

Scaling laws in ecology, intended both as functional relationships among ecologically relevant quantities and the probability distributions that characterize their occurrence, have long attracted the interest of empiricists and theoreticians. Empirical evidence exists of power laws associated with the number of species inhabiting an ecosystem, their abundances, and traits. Although their functi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1406.4441  شماره 

صفحات  -

تاریخ انتشار 2014